ENH/API: Add count parameter to limit generator in Series, DataFrame, and DataFrame.from_records() #5898

tinproject · 2014-01-10T02:08:01Z

When reading data from a generator type collection only reads the firsts number (count) of values.
First step to solve #2305, known the length of the data to allocate memory.

Add count parameter to Series, DataFrame, and DataFrame.from_records().
In DataFrame.from_records() deprecate the existing parameter nrows. Count it's more general and only refers to quantity of data units. Also exists on numpy API (fromiter).
Some refactor in DataFrame.from_records().
Add tests too.
Lack of release docs for now.

Add count parameter to Series, DataFrame, and DataFrame.from_records(). When reading data from a generator type collection only reads the firsts count values. Some refactor in DataFrame.from_records(). Add tests too.

jreback · 2014-01-10T02:21:38Z

can u give an actual use case of this?

tinproject · 2014-01-10T02:30:58Z

You can do random walks from infinite random number generator.
If you have a big file and yields every line processed with a generator, limit the number of lines readed.

I't focused on the solution of 2305, to load data directly and in a efficient manner you need to know how many memory you will need, and generators and iterators are of indefinite length so aditional help it's needed.

jreback · 2014-01-10T02:34:17Z

not a big fan of adding a keyword to the constructors

maybe a better way is to allow data to be a callable them u can imbed islice if you wanted too

jtratner · 2014-01-10T05:10:55Z

At the very least, better to make this a classmethod

ghost · 2014-01-18T03:34:46Z

me third, wrong approach. support reduced memory data loading from an iterator of known length
is not going to happen by extending existing methods.

viz. recent discussion in #2193, we like the idea done as a new class method.
Overview:

take a count
infer dtypes from first line (or a few if nans encountered)
pre-allocate numpy arrays of known size and dtype
consume the iterator and fill in the preallocated array
pass the arrays or whatever works to reuse the arrays as the underlying
data for a pandas data object.

I planned to try this myself for 0.14, but if you beat me to it so much the better.
If block manager confuses you, do 1-4+tests and we can take it from there.

This I'm closing.

ghost · 2014-01-18T03:35:37Z

p.s.
If you nail down a solid, reproducible way of measuring the memory allocation difference, even better.

jtratner · 2014-01-18T05:24:57Z

side note - why is it useful to not create an intermediate list or ndarray? I assure you that even if you pass in a generator, eventually it's going to be converted into list or ndarray, then have its types massaged, etc. The sugar of pandas does sacrifice some memory efficiency in loading and manipulating data. It's a tradeoff and sometimes you might need to stick with numpy for really performance critical parts.

ghost · 2014-01-18T11:06:59Z

@jtratner , are you asking this having read through the discussion in #2193?

tinproject · 2014-01-18T19:33:48Z

@y-p go fot it, currently I'm only a hobby programmer and don't have many time available to make it, but I'm happy to help on anything in my little spare time.

I have been thinking on this problem of memory consumtion for months, and I came to the same task list as you, actually this PR match the number 1 task, it was thought for it.

I made this on hollydays time, couse I didn't have enought time to make all the code I've filled GH5902 to 'try' to express my ideas on the rest of the process. Unluckily I'm not a native english speaker and I dont express myself as good as I want to.

This PR it's focus on the API changes: add of a 'count' attribute. Maybe I aim to a too broad inclusion of the attibute for consistency but anyway you are going to need a 'count' attribute becouse generatos/iterators are not sized objects, they have no __len__(), and therefore it must be told explicitely.

On memory consumtion measure.

Time ago I try with IPython notebooks and memit, but I prefer to use common sense.

Imagine you have a 1000 number generator, this generator yields int python objects. Pandas put the whole generator in a list, of python int objects, then put the objects of the list inside an ndarray. Result, twice the memory needed used, 1000 x PythonIntSyze + 1000 x dtypeSize + ndarray overhead.
The same generator, this time readed in chuncks of 100 objects (when read in chunks don't need to know the length of the gernator). Consume a chunck length number of objects of the generator, put them on a list, and then put the chunk on a ndarray. When have all the chuncks in ndarray form put all in one unique ndarray. Depends of number of objects maybe less, maybe more memory than the point before depends on objects overhead but in the same order: 1 x 100 x PythonIntSyze + 10 x (100 x dtypeSize + ndarray overhead)+ 1000 x dtypeSize + ndarray overhead.
The same generator, one more time. This time you know the length of the data and the dtype sou you can allocate the ndarray and puth the data yielded by the generator one by one. Result, only use the memory needed: 1 x PythonIntSyze + 1000 x dtypeSize + ndarray overhead.

In case the generator yields more results than the count, you can simply ignore them. In case the generator exausted before count resize the ndarray to the correct size. As it only downsize the ndarray there will not be copy/move operations in memory so no performance penalty for it.

tinproject mentioned this pull request Jan 10, 2014

PERF: Load data (create Series, Dataframe) in a more functional way. #5902

Closed

ghost closed this Jan 18, 2014

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/API: Add count parameter to limit generator in Series, DataFrame, and DataFrame.from_records() #5898

ENH/API: Add count parameter to limit generator in Series, DataFrame, and DataFrame.from_records() #5898

tinproject commented Jan 10, 2014

jreback commented Jan 10, 2014

tinproject commented Jan 10, 2014

jreback commented Jan 10, 2014

jtratner commented Jan 10, 2014

ghost commented Jan 18, 2014

ghost commented Jan 18, 2014

jtratner commented Jan 18, 2014

ghost commented Jan 18, 2014

tinproject commented Jan 18, 2014

ENH/API: Add count parameter to limit generator in Series, DataFrame, and DataFrame.from_records() #5898

ENH/API: Add count parameter to limit generator in Series, DataFrame, and DataFrame.from_records() #5898

Conversation

tinproject commented Jan 10, 2014

jreback commented Jan 10, 2014

tinproject commented Jan 10, 2014

jreback commented Jan 10, 2014

jtratner commented Jan 10, 2014

ghost commented Jan 18, 2014

ghost commented Jan 18, 2014

jtratner commented Jan 18, 2014

ghost commented Jan 18, 2014

tinproject commented Jan 18, 2014

On memory consumtion measure.